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Abstract 

We present a complete set of Fortran 90 modules that can be 
used to write very compact, efficient, and high level QCD programs. 
The modules define fields (gauge, fermi, generators, complex, and real 
\ fields) as abstract data types, together with simpler objects such as 

SU(3) matrices or color vectors. Overloaded operators are then de- 
fined to perform all possible operations between the fields that may 
be required in a QCD simulation. QCD programs written using these 
modules need not have cumbersome subroutines and can be very sim- 
ple and transparent. This is illustrated with two simple example pro- 
grams. 
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PROGRAM SUMMARY 



Title of program: QCDF90 

Computer for which the program is designed: Any computer 
Computers under which the program has been tested: Silicon Graphics In- 
digo and PowerChallengeArray, and IBM R6000 58H Installations: Boston 
University, Center for Computational Science and Department of Physics. 
Operating systems under which the program has been tested: IRIX 6.1, and 
AIX. 

Programming language used: Fortran 90 
No. of bits in a word: 64 

No. of lines in distributed program, including test data, etc: 7806 
Keywords: QCD, lattice gauge theory 

Nature of physical problem: Non-perturbative computations in QCD 
Memory required to execute with typical data: Varies according to the appli- 
cations. Scales proportionally to the lattice volume NX * NY * NZ * NT. 
On a 16^ lattice, the example codes quenched. f90 and propagator.f90 use 
approximately 110 Mbytes and 140 Mbytes respectively. 
Typical running time: Varies according to the applications. The example 
codes quenched.f90 and propagator.f90 take approximately 45 microsec to 
update an SU(3) link, 8 microsec to calculate a plaquette, and 20 microsec 
for a CG step per link, using a 16^ lattice, on an SGI Power-Challenge per 
node. 
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LONG WRITE-UP 



1 Introduction 

The computer simulation of quantum fluctuations (cf. for instance [@], 
[§]) has been one of the most powerful tools for obtaining information about 
the non-perturbative properties of quantum field theories in general, and, 
especially, of Quantum Chromodynamics (QCD) (good accounts of progress 
in this field of research can be found in the proceedings of the yearly interna- 
tional symposia on lattice gauge theories [Q; see also 0). These simulations, 
which deal with matrix and vector fields defined over a four- dimensional 
space-time lattice, involve huge number of variables and are very demand- 
ing in computer resources. Therefore, good payoffs can be obtained in this 
domain of applications from the development of highly-efficient code. On 
the other hand, even greater gains can be achieved through the invention 
of better algorithms, which is made much easier by the availability of high- 
level, structured programming tools. High-level programming tools are also 
invaluable for extracting physical results from the data collected in the sim- 
ulations, which typically requires experimenting with different types of data 
analysis and involves substantial amounts of code development. 

With the twofold goal of facilitating the development of algorithms and 
applications for lattice QCD, and of maintaining good code performance, we 
have taken advantage of the possibilities offered by Fortran90 to write a set 
of modules for a high-level, yet efficient implementation of QCD simulations. 
Our end product is described in this long write-up, whose main purpose is to 
provide researchers with all the information needed to use our modules. Since 
this effectively makes the long write-up a reference document, it is indeed, 
and necessarily, "long" . We have nevertheless striven to be concise, in order 
to save space and, especially, because we felt that a concise document would 
make it easier for the user to find the relevant information. Most of the times 
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the functionality provided by our modules will be obvious. For instance, if 
fl and f2 are two variables of type fermiJield (see later for the precise 
definition), f 1+f 2 will have as components the sum of the components of the 
two fields. Similarly, if gl and g2 are variables of type gaugeJield, gl*g2 will 
have as components the matrix products of the components of gl and g2. In 
other instances, however, we had to use a bit of creativity in adapting the 
symbols of the language to the definitions of some further useful overloaded 
operators. Thus, if f 1 and f2 are again variables of type fermiJield, f l//f2 
will be for us a variable of type gaugeJield having for components the dyadic 
formed by the vector components of the two fermiJields. For all these less 
obvious definitions, there is no substitute to reading the sections of this 
article, where all of our overloaded operators are carefully documented. 

We expect that most of the users of our modules will be practitioners of 
lattice gauge theory, and as such already quite knowledgeable about the type 
of variables that enter QCD simulations. Having in mind, however, that some 
of the users might be application scientists called on to benchmark code with 
which they are not too familiar, we decided to include in this write-up a very 
concise description of the data structures encountered in QCD simulations. 
This is presented in the next section, which summarizes the kinematics that 
has been used for lattice QCD since the pioneering work of Wilson ^j. The 
section that follows discusses the all-important notion of parallel transport 
in presence of a gauge field and our implementation of parallel transport via 
a generalization of the C-shift operation, which we call "U-shift" . Sections ^ 
1^ and ^ deal with algorithmic issues related to the ordering of the data, with 
the description of the data types, and with some further considerations of 
programming and efficiency. The remaining sections are devoted to a detailed 
description of our modules and of the functionality which they provide. 
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2 Geometry and variables 



We consider a four-dimensional lattice with extent NX, NY, NZ and NT in the 
four directions. We will assume that NX, NY, NZ, NT are all even. A lattice site 
will be labeled by four integer valued variables x, y, z, t with 

< X < NX, < y < NY, < z < NZ, < t < NT . (1) 

When convenient, wc will denote the collection of these four indices by x. 
We will assume periodic boundary conditions. 

The physical system is defined in terms of two types of variables (also 
called the dynamical variables): the gauge fields and the Fermi fields. 

The components of a gauge field arc 3x3 unitary, unimodular matrices 
(i.e. elements of the group SU{3), the so called "color" group) defined over 
the oriented links of the lattice. Later wc will sec that programming consid- 
erations demand a more involved layout of data, but, conceptually, a gauge 
field can be considered as a multidimensional array of complex variables 

U(3, 3, : NX - 1, : NY - 1, : NZ - 1, : NT - 1, 4) , (2) 

where the first two indices of the generic array element U(i, j, x, y, z, t, m) 
are the indices of the SU (3) matrix, whereas x, y, z, t label a lattice site and 
m = 1 ... 4 labels one of the four lattice hnks having origin at the site and 
oriented in the positive m direction. When convenient we will use the more 
compact notation f/^^^ to denote the gauge field elements, or to denote 
the whole matrix defined over the link (in this compact notation we follow 
the common practice of using a Greek letter to denote the direction of the 
link). Another useful notation consists in representing by /i a four- vector 
having its /i component equal to 1, all other components equal to zero. With 
this notation, one can say that the gauge variable is defined over the 
oriented link from x to x -|- /i. 
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The components of a Fermi field are defined over the sites of the lattice. 
They are 3-dimensional complex vectors with respect to the matrices of the 
color group and carry an additional spin index s ranging from 1 to 4. Thus 
the data layout of a Fermi field can be represented conceptually in terms of 
an array of complex variables 



where the first index of the generic array element f (i, x, y, z, t, s) is the color 
index, x, y, z, t label the site and s is the spin index. When convenient we 
will use a more compact notation ipi^x,s for the components of a Fermi field, 
or ip^ s for the whole color vector, or even just 

In the field theoretical definition of the physical system the components of 
a Fermi field would be anticommuting elements of a Grassmann algebra with 
integration. The rules of integration over elements of a Grassmann algebra 
ipa, i^b [o- and b stand for any complete set of indices) have, as their most 
important consequence, the formula 



In computational applications DetfA] or its derivatives with respect to the 
dynamical variables are calculated by means of ordinary complex variables 
(pa, <t>b making use of the identity 



f (3, : NX - 1, : NY - 1, : NZ - 1, : NT - 1, 4) , 



(3) 







Thus effectively one deals with arrays of complex variables as in (^). 
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3 The notion of U-shift 



The gauge field serves to define the transport of dynamical variables between 
neighboring sites. Gauge theories are characterized by the property of local 
gauge invariance. In the present context this means that it is always possible 
to redefine the Fermi variables by an SU (3) transformation 

j 

where the elements of the gauge transformation Gjj,x are 5'f/(3) matrices 
defined over the sites. All of the physical quantities must remain invariant 
under such transformations. 

It is clear that, if the Fermi fields transform according to @ with a Gjj,x 
that changes from site to site, a straightforward finite difference (as one would 
use in the approximation of a derivative) 

(A?/^)i,x,s = ^i,x+/i,s - V'i,x,s (7) 

will produce meaningless results. Rather, one should "transport" the variable 
T/^x+zi from the site x + /t to the site x by means of the gauge variable 
defining a shifted variable 

C?' = E f^4x^.,x+A,« (8) 
and then define a gauge covariant finite difference 

(D^),x,. = - ^^,^,s ■ (9) 

Under a gauge transformation the gauge field itself changes according to 

kl 



6 



From Eqs. (|^,^,|1^) one can verify that under a gauge transformation the 
gauge covariant finite difference changes hke ip itself: 

(i^^)^,x,. ^ {D^%,.,s = E G^JAD^),,^,s . (11) 

j 

Thus the gauge covariant finite difference is a meaningful construct and quan- 
tities such as its magnitude or the scalar product Si ''/'i,x,s(-D^/')i,x,s' are gauge 
invariant and thus physically well defined. 

It is clear from the above that a circular shift (C-shift) of an array such 
as f(3, : NX - 1, : NY - 1,0 : NZ - 1,0 : NT - 1,4) will generally be 
complemented by multiplication by an element of the gauge field. We will 
therefore define a U-shift operation in the following manner. 

A U-shift with positive direction parameter /i = 1 ... 4 of the Fermi field 
V'i.x.s produces the array ipf^^l'^'^ as given by Eq. (||). 

A U-shift with negative direction parameter fi' = = —1 ... — 4 of the 
Fermi field V^i^x.s produces the array 

C?" = E f^c-A^^™ , (12) 

j 

this latter equation being motivated by the fact that the transport factor 
over a link crossed in the negative direction is the Hermitian adjoint (or 
equivalently the inverse, with a unitary group t/^ = U~^) of the transport 
factor for the positively oriented link. 

We define a U-shift for the gauge field as well. Since the gauge field ele- 
ments have two color indices which should be associated with the beginning 
and end of the link (cf. the gauge transformation properties of the gauge field 
variables Eq. (p!0|)) the U-shift of a gauge field will involve two matrix mul- 
tiplications. Moreover, it will be convenient to define its action on a generic 
gauge field, denoted below by V, not necessarily identical to U. The idea is 
that in general there will be several variables with the properties of a gauge 
field (see the type definitions below) but there will always be one well defined 
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"master gauge field" , denoted by U, which will serve to define the transport 
of all gauge dependent variables. With this in mind the action of a U-shift 
on a gauge field is defined as follows. 

A U-shift with positive direction parameter /i = 1 ... 4 of the gauge field 
Vj^x produces the array 

^shifted,. ^ ^ jjt. yu f/tM _ (13) 
kl 

A U-shift with negative direction parameter /x' = — /i = —1 ... — 4 of the 
gauge field V^^.^ produces the array 

yShl{ted,V yu TTfl /-|.^ 

' i,x / J '-^ ik.-x — fi.' kl.-x—fi^ li.-x—u+v ■ V / 

kl 

When acting on field variables which carry no color index (we will define 
such field variables below) the U-shift reduces to an ordinary C-shift. 

4 Even and odd components of field variables 

All lattice sites can be subdivided into "even" and "odd" sites according to 
whether the sum of the integer valued coordinates x + y + z + t is even or 
odd (checkerboard subdivision). Correspondingly all field variables can be 
divided into even and odd variables (for a gauge field variable we base the 
subdivision on the origin of the link over which the variable is defined, i.e. 
the X, y, z, t indices of the array (0)). With periodic boundary conditions and 
with an even lattice size in all directions, a C-shift or a U-shift of an even 
field variable produces an odd field variable and vice versa. There are many 
algorithms which demand, especially in the context of a parallel implemen- 
tation, that even and odd variables be treated separately. For example, in a 
Monte Carlo simulation algorithm all variables at even sites can be upgraded 
simultaneously while those at odd sites are kept fixed and vice versa. We will 
accommodate these algorithmic demands by defining all of our field variables 
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as arrays of even or odd field variables. We will do so by taking advantage 
of the type definition as follows. All field variables will be defined through 
a type. The first component of the type will be an integer variable parity 
which will take values and 1 for variables defined over even and odd sites 
respectively. It will also be convenient to use the value —1 to characterize a 
field with parity undefined. 

For the gauge variables it will be convenient to include in the type a single 
yU component (of definite parity, of course). Thus, in addition to the variable 
parity, the type will contain an integer variable dir, taking values 1 ... 4, 
to denote the direction of the link (i.e. the value of the index /i). It will be 
convenient to let dir also take the value 0, to characterize 3x3 complex 
matrices which are defined over the sites rather than over the links, such as 
the SU{3) matrices of the gauge transformation in Eq. (P). 

Finally the type will then contain an array, denoted by f c (for field com- 
ponent) which will contain all the field variables defined over the sites of a 
given parity. 

Insofar as the indexing of the array is concerned, this is to a large ex- 
tent arbitrary, provided that the mapping between the array indices and the 
actual Cartesian coordinates of the site is well defined. For instance, one 
could collapse two neighboring "time" slices into a single one and use in- 
dices X, y, z,t where t ranges now from to NT2 — 1 = NT/2 — 1. On the 
other hand, with many architectures efficiency considerations recommend 
that the indices x, y, z, t be fused into a single index, spanning the range 
to NXYZT2 - 1 = NX * NY * NZ * NT/2 - 1. This will typically be the case 
when, because of a vectorized or superscalar architecture, the instructions 
are pipelined and longer arrays give rise to better performance. In principle 
a good optimizing compiler should recognize when the individual loops over 
the X, y, z and t indices can be fused into a single one and take advantage of 
this possibility. However, some compilers may be able to fuse only a limited 
number of nested loops or, alternatively, this type of optimization may be 
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hindered by the the presence of further indices or by a large number of in- 
structions within the loops. Since the use of types and overloaded operators 
makes the actual indexing of the arrays transparent to the user, we decided 
to use a single index to label all of the sites of a definite parity. This index 
is constructed by going through the sites of definite parity in a lexicographic 
order, increasing the x coordinate first, then y, z and t, but, as stated above, 
the ordering of the sites is largely immaterial. For all those operations which 
are performed locally over the sites, the detailed mapping between the in- 
dex and the geometry of the lattice is clearly irrelevant. It is, of course, of 
consequence for the implementation of the shift operations and for accessing 
the component of a field at a definite Cartesian site. For such purposes we 
provide the specifics of the mapping through some global variables and an 
initiahzation subroutine. We define the following global variables: 
INTEGER, DIMENSI0N(0:NX-1,0:NY-1,0:NZ-1,0:NT-1) :: xyzt_index 
INTEGER, DIMENSI0N(0:NX-1,0:NY-1,0:NZ-1,0:NT-1) :: xyzt_parity 
INTEGER, DIMENSI0N(0:NXYZT2-1, 0:1, 4) :: xyzt.cartesian 
INTEGER, DIMENSI0N(0:NXYZT2-1, 0:1, 8) :: xyzt_neighbor 
LOGICAL shift_initialized 

where the parameter NXYZT2 = NX * NY * NZ * NT/2 equals one half of the total 
number of sites in the lattice. 

The arrays defined above are initialized by executing the subroutine 
shif t_initialization. The variable shif t_initialized is initialized to 
.FALSE. All of the function calls which implement the overloaded shift opera- 
tors check the value of shif t_initialized. If this is .FALSE. , the subroutine 
shif t_initialization is called and the arrays are properly initialized. Be- 
fore returning, shif t_initialization sets shif t_initialized to .TRUE. 
From this moment on the arrays can be used to establish the mapping be- 
tween the Cartesian coordinates and the indices within the sublattices of 
definite parity. The programmer wishing to use these arrays before any shift 
operation is performed can, of course, initialize them directly via a call to 
shif t_initializat ion. 
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The array component xyzt_index(x,y,z,t) gives the index of the field 
component defined over the site with Cartesian coordinates x, y, z, t. 
xyzt_parity(x,y,z,t) gives the parity of the site (xyzt_parity(x, y, z, t) = 
x + y + z + tMOD 2). xyzt_cartesian(i,p,in) gives the Cartesian coordinate 
(x,y,z,t for m=l,2,3,4 respectively) of the site with index i and parity 
p. Finally xyzt_neighbor(i,p,in) gives the index of the nearest neighbor 
site in direction m of a site with index i and parity p. The convention is 
that the values in=l ,2,3,4 correspond to the nearest neighbor in the forward 
x,y,z,t directions, whereas ni=5,6,7,8 correspond to the nearest neighbor 
in the backward x, y, z, t directions, respectively. 

5 Types 

We define the following F90 types. 
5.1 Type gauge_field 

TYPE gauge_field 
INTEGER parity 
INTEGER dir 

COMPLEX (REALS) .DIMENSION (3, 3, 0:NXYZT2-1) : :f c 
END TYPE 

As discussed in the previous section, a variable of type gaugeJield contains 
the components of a gauge field defined over all the links of direction dir 
emerging from the lattice sites of a given parity. The field component 
fc(i, j,xyzt) is an array of double precision complex variables (the kind 
REALS is defined in the module "precisions", see below), where i, j are the 
indices of the SU (3) matrix and xyzt labels the site within the subset of 
sites of a definite parity. 
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5.2 Type full_gauge_field 



TYPE full_gauge_f ield 

TYPE(gauge_field) , DIMENSI0N(0:1,4) :: uc 
END TYPE 

A variable of type full_gauge jield is meant to store an entire gauge field con- 
figuration, i.e. 8 variables of type gaugeJield corresponding to the two parity 
components and the 4 direction components of a full gauge field. Although 
the parity and dir components of the individual uc(i, j) components can 
be given any value, good programming practice recommends that one sets 
uc(i, j)%parity = i, uc(i, j)%dir = j. 

5.3 Type fermi_field 

TYPE fermi_f ield 
INTEGER parity 

COMPLEX (REALS) , DIMENSION (3, 0:NXYZT2- 1,4) : :f c 
END TYPE 

A variable of type fermiJield contains the components of a Fermi field defined 
over the lattice sites of a given parity. The field component fc(i,xyzt, s) 
is an array of double precision complex variables, where i is the color index, 
xyzt labels the site and s is the spin index of the field. 

5.4 Type complex_field 

TYPE complex_f ield 
INTEGER parity 

COMPLEX (REALS) , DIMENSION (0:NXYZT2-1) : :fc 
END TYPE 

The type complex_field is introduced to store an array of complex numbers 
f c(xyzt) defined over the lattice sites of a given parity. Although one could 
also store such variables in an array of complex numbers, defining a type has 
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the advantage that one can record the parity of the field and that it becomes 
possible to define overloaded operators (intrinsic operations on intrinsic types 
cannot be overloaded) . A similar remark applies to the type realJield defined 
below. 



5.5 Type reaLfield 



TYPE real_field 
INTEGER parity 

REAL (REALS) , DIMENSION (0 : NXYZT2-1) : :fc 
END TYPE 

The type realJield is introduced to store an array of real numbers f c(xyzt) 
defined over the lattice sites of a given parity. 

5.6 Type generator_field 

TYPE generator_f ield 
INTEGER parity 
INTEGER dir 

REAL(REAL8) ,DIMENSI0N(8,0:NXYZT2-1) : :f c 

END TYPE 

Although for computational purposes it is useful to store the components of 
an SU{3) gauge field as 3 x 3 complex matrices, a general SU{3) matrix is 
a function of only 8 real independent parameters. In particular, given an 
8-dimensional real vector with components Vk one can associate to it the 
SU{3) matrix 



(15) 



exp {i VkX'') 

k=l 

where the matrices form a basis in the space of Hermitian traceless 3x3 
matrices and satisfy the equations Tr(A*^A*^') = for A; 7^ k', Tr(A^)^ = 2. 
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The term group generator is commonly used to refer to a traceless Her- 
mitian matrix, such as 

H^, = E v,X% (16) 

k=l 



in Eq. (]T5|). For some algorithms it is convenient to deal directly with the 
components Vk of a generator, rather with the exponentiated matrix U or the 
Hermitian matrix H . For this reason we provide the type generatorJield, 
aimed at storing generator components defined over the sites of a given 
parity. Since generators are frequently associated to gauge field variables, 
we give the type generatorJield a dir component as well. 



5.7 Type matrix 



TYPE matrix 

COMPLEX (REALS) .DIMENSION (3, 3) : imc 
END TYPE 

The type matrix is defined for programming convenience, in order to allow 
for the overloading of operators and assignments. For instance, it makes it 
possible to define an operation g * m, where the variables g and m are of type 
gaugeJield and matrix respectively, which implements the matrix product of 
the components of a gauge field times a constant matrix. 



6 Programming and efficiency considerations 

6.1 One layer versus two layer data structure 

Conceptually our variables would be most naturally defined in terms of a two 
layer data structure. At the bottom layer we would find objects such as a 
single SU (3) matrix or a single color vector, i.e. three dimensional complex 
vector. Overloaded operations such as matrix multiplication or multiplication 
of a matrix times a color vector would also be defined. At the top layer we 
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would then use these objects to define extended fields, such as the gauge field, 
consisting of an array of objects of type matrix. Operators among the objects 
of the top layer would be built from the elemental operators already defined 
at the bottom layer. However appealing, this organization of the data would 
almost certainly imply a huge penalty in efficiency. It is indeed reasonable 
to expect that the compiler will implement overloaded operations in terms 
of function calls. In a two layer structure, then, an operation such as the 
addition of two Fermi fields would be implemented via repeated calls, site by 
site, to the function which adds the color vector components of the two fields. 
It is clear that this use of function calls at very low granularity would imply 
a heavy computational burden. The only way to regain efficiency would be 
to inline the function calls implementing the elemental operations. While in 
principle this is possible, it is not reasonable to expect that compilers would 
generally allow inlining of function calls that implement operations among 
derived data types over which they have little direct control. For this reason 
we decided to forfeit the possibility of defining a two layer data structure, 
however conceptually pleasing this may be, and organized all of our data 
into a single layer of user defined types. Thus the types which we introduce 
to define extended fields are, essentially, F90 arrays complemented with one 
or two variables (parity, dir) specifying their attributes. As a consequence 
the computational cost for the use of overloaded operators between our data 
structures should not be any bigger than the cost of a call to a function or 
subroutine that manipulates large arrays. On the other hand, the advantages 
we gain in code structure and ease of programming are truly remarkable. 

6.2 Overloaded assignments 

The use of overloaded operators may imply the creation of more temporaries 
and, consequently, more motion of data than a straightforward implementa- 
tion of operations among arrays. Consider for example the following opera- 
tion among variables of type fermi_field: 

f 1 = f 1 + f2 + f3 . (17) 
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(We will formally define the addition of Fermi fields later, but it will perform 
the obvious operation of adding the f c components of the fields.) 

With ordinary arrays the compiler might put the result of f 1 + f2 in 
a temporary tl and then add tl and f3 placing the result in f 1. Thus 
there would be two write-to-memory operations per component of the arrays. 
(A good optimizing compiler could even use registers, dispensing with the 
creation of the temporary and of one of the copies to memory.) However, if 
the overloaded addition of Fermi fields is implemented via function calls, what 
we expect to happen is that the function implementing f 1 + f 2 places the 
result into a temporary tl returning the address of the corresponding data 
structure to the calling program. The compiler at this point will probably 
copy tl into a temporary t2, since it would not be safe to pass the addresses 
of tl and f3 to the add function which will likely put the result into tl 
again. Finally, the result will be copied into fl. If implemented in this 
manner, the entire operation involves four write-to-memory operations: to 
tl, to t2, to tl again and to f 1. (Of course, all of the above is implementation 
dependent. As far as we know, F90 does not specify how the variables should 
be passed in function calls. An operating system could let the calling program 
pass to a function the address where it expects the result, making the call 
a = function(b, c) effectively identical to CALL subroutine(a, b, c). In this 
case the composite operation (0) could be implemented with two copies to 
memory only.) 

The procedure could be drastically simplified through the use of an over- 
loaded assignment -|- =. Instruction ([T^) could be written 

fl + = f2 + f3 , (18) 

which the compiler would implement by issuing first a call to a function 
that adds f2 and fS returning the result in tl. The addresses of f 1 and 
tl would then be passed to a subroutine, e.g. plus_eq(a, b) that implements 
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the operation fl = fl+tl among the components of the data types. The 
required number of copies to memory would be only two. 

In order to allow for these possible gains in efficiency, we have defined a 
large set of overloaded assignments, which will be detailed in the description 
of the module "assign" given below. Since F90 permits only the use of the 
= symbol for the assignment, we have implemented its overloading by defin- 
ing two global variables: a character variable assign_type and an integer 
variable assign_spec (for assign specification, introduced to accommodate 
assignments of a more elaborate nature). The default values of these vari- 
ables are ' =' and 0. They are initialized with these values and reset to their 
default values at the end of all overloaded assignments. We follow this pro- 
cedure to avoid the occurrence of accidental erroneous assignments. When 
assign_type equals ' =' the result of the assignment between variables of 
identical type is the expected copy of the data structure at the r.h.s. into the 
variable at the l.h.s.. (We also define overloaded ' =' assignments between 
variables of different type; the results of such assignments are explained in 
the description of the module "assign".) Overloaded assignments such as 
a + = b are obtained by setting assign_type (and possibly assign_spec) to 
the appropriate value immediately before the assignment. We recommend 
the following pattern for the instructions: 

assign_type =' -|-'; a = b (19) 

or (this implements a U-shift from direction n) 

assign_type =' u'; assign_spec = n; a = b (20) 

The overloaded assignments are implemented via case constructs, which make 
reference to the values of the global variables assign_type, assign_spec. A 
simplified version of the code for an assignment would be as follows: 

SUBROUTINE typea_eq_typeb(a,b) 
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TYPE(typea), INTENT (INOUT) :: a 
TYPE(typeb), INTENT (IN) :: b 
SELECT CASE(assign_type) 
CASE('=') 

implements a=b 
CASE(^ + 

implements a=a+b 
CASE DEFAULT 

returns cin error message and stops execution if the value 
of assign_type does not correspond to any defined assignment 
END SELECT 

assign_type='=' ; assign_spec=0 
END SUBROUTINE typea_eq_typeb 

We wish to emphasize that the structure of data and operations which we 

have introduced may still cause loss of efficiency with some compilers, even 

with an optimizing one. It might happen that code performing the same 

calculations as a code written in terms of our data structures, but formulated 

without use of any derived data types, is converted, upon compilation, into 

a more efficient executable. However, we designed our data structure and 

defined our operators and assignments in a way which should present no 

barrier to a highly efficient, parallelizing compilation. It will be an interesting 

experiment to verify how different compilers respond to it. 

7 Modules 

7.1 Module precisions 

This module defines two kind parameters, REALS and LONG. These param- 
eters store the kind of an 8-byte floating point variable and of an 8-byte 
integer variable. They are used to render the kind definitions machine inde- 
pendent. INTEGER(LONG) variables are used only for the parallel generation of 
pseudorandom numbers in a system independent way (cf. the module "ran- 
dom_numbers" ) . If 8-byte integers are not supported by the architecture, the 
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module random_numbers should be modified to run with shorter integers or 
to use system supplied parallel pseudorandom numbers, and the definition 
of LONG should be changed accordingly. 

7.2 Module globaLmodule 

This module defines the integer constants NX, NY, NZ and NT which specify 
the size of the lattice. NX, NY, NZ, NT must all be even. It defines the reduced 
temporal extent NT2 = NT/2, and the products NXYZT = NX * NY * NZ * NT, 
NXYZT2 = NX * NY * NZ * NT2. It also defines for convenience the constants 
NCGV = 9 * NXYZT2, NCFV = 12 * NXYZT2, NRGV = 2 * NCGV, NRFV = 2 * 
NCFV, NRGEV = 8 * NXYZT2, which are equal to the number of complex or, 
respectively, real variables in the fc components of the types gauge_field, 
fermi Jield and generator_field. 

All of the types introduced in Sect. |^ are declared in this module. 

Finally the module declares a few global variables, namely, the master 
gauge field: 

TYPE(full_gauge_f ield) u 

the assignment variables (cf. Sect. |6.2| ): 

CHARACTER assign_type 

INTEGER assign_spec 

the arrays xyzt_index, xyzt_parity, xyzt_cartesian, xyzt_neighbor and 
the logical variable shif t_initialized, already mentioned in Sect. ^, 
the context logical array, used in conditional operations (cf. the module "con- 
ditionals"): 

LOGICAL, DIMENSION(0 : NXYZT2 - 1) :: context 

and the variables used for the generation of pseudorandom numbers (see the 
module "random_numbers" ) : 
INTEGER seed_a, seed_b 

INTEGER, DIMENSION(0 : NXYZT2 - 1) :: seeds 

The module contains the subroutine shif t_initialization (see Sect. 
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7.3 Module constants 



This module defines some useful parameters, making them available to all 
program units which use it. Namely, the following real constants are defined: 
PI (tt), PI2 (7r/2), TWOPI (27r), SQRT2 (v^), SQRT22 (^2/2), SQRT3 (^3), 
SQRT33 (v/3/3), TWQSQRT33 {2y/Z/?>), the complex constant lU (z), and the 
arrays: 

C0MPLEX(REAL8), DIMENSI0N(3, 3) :: ZEROjn, UNIT, lUjn 
C0MPLEX(REAL8),DIMENSI0N(3) :: ZERO_v 
REAL(REAL8),DIMENSI0N(8) :: ZERO_ge 
C0MPLEX(REAL8),DIMENSI0N(3,3,8) :: LAMBDA 
C0MPLEX(REAL8), DIMENSI0N(4, 4, 5) :: GAMMA 

UNIT and IU_m are set equal to the unit matrix, and to i times the unit 
matrix, respectively. ZERO_m, ZERO_v, ZERO_ge have all components equal to 
zero. The array LAMBDA stores the components of the A matrices: 
LAMBDA(i,j,k) = Af,^., 

and the array GAMMA stores the components of Dirac's 7 matrices, 
GAMMA(sl, s2, m) = 7^,s2> = 1 . . . 5 , in our chosen representation. 
(We follow the convention 7^ = 7^7^7^7^-) 

Notice that we do not make any distinction between upper and lower 
indices for the A and 7 matrices: A'^ = A^, 7"^ = 7^ and the use of upper or 
lower indices is only dictated by notational convenience. 

7.4 Module field_algebra 

This module defines several overloaded operators that perform arithmetic 
operations between fields and other variables. We describe here all the op- 
erations which are defined. For conciseness we introduce notational conven- 
tions. We use the symbols g, u, f , c, r, ge and m to denote variables of 
type gaugeJield, fulLgaugeJield, fermi_field, complexJield, realJield, gener- 
atorJield and matrix, respectively, and the symbols complex and real to 
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denote a complex or real variable of kind REALS (cf. Sect. [7]^). When neces- 
sary, we will use subscripts, e.g. fi,f2, to distinguish between two variables 
of the same type. 

All operators obey the following general rules: 

i) When the result of the operation is a field, if the two operands have a 
parity component, the parity of the result is the parity of the operands if 
they have the same parity, otherwise it is undefined (i.e. = —1). If a single 
operand has a parity component, then the parity of the result takes the 
same value. A similar rule applies to the direction component of the variables 
of type gaugeJield and generatorJield: if both operands have the same dir 
or a single operand carries a dir component, then the dir component of the 
result is set to this value. Otherwise it is set to 0. 

ii) When the operator acts between fields, the operation is performed site 
by site and the result is again a variable of field type. When the operator acts 
between a variable of type field and a global variable (i.e. m, complex and 
real) the site variable is combined with the global variable. For example, 
the operations c = Ci + C2 and c = Ci + complex would be implemented as 

DO xyzt=0,NXYZT2-l 

c°/of c(xyzt)=cl%fc(xyzt)+c2%f c(xyzt) 
END DO 

and 

DO xyzt=0,NXYZT2-l 

c°/of c(xyzt)=cl%f c (xyzt)+complex 
END DO 

respectively. 

The following operations are defined and have the obvious meaning, im- 
plicit in the symbol: 

gl+g2, gl-g2, gl*g2, g*f, f*g, g*C, C*g, g/c, 

g*r, r*g, g/r, g + m, m + g, g - m, m - g, g * m, m * g, 

g * complex, complex * g, g/complex, g*real, real * g, g/real; 
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fi + f2, f 1 — f2, f * c, c * f , f/c, f * r, r * f , f/r, f * m, m*f, 
f * complex, complex* f, f/complex, f * real, real * f , f/real; 

Ci + C2, Ci - C2, Ci * C2, Ci/c2, c + r, r + c, c - r, r - c, 
c * r, r * c, c/r, r/c, c + complex, complex + c, c — complex, 
complex — c, c * complex, complex * c, c/complex, complex/c, 
c + real, real + c, c — real, real — c, c * real, real * c, 
c/real, real/c; 

ri + r2, ri - r2, ri * r2, ri/r2, r + real, real + r, 
r — real, real — r, r * real, real * r, r/real, real/r; 

gei + ge2, gei-ge2, ge * r, r * ge, ge/r, ge * real, 
real * ge, ge/real; 

mi+m2, mi — m2, mi*m2, m* complex, complex * m, m/complex, 
m * real, real * m, m/ real; 

Wc do not provide any clarification about the operations listed above (it 
would be truly superfluous) but for the observation that the symbol * implies 
matrix multiplication when acting between operands of type gaugeJield or 
matrix, and matrix by vector or vector by matrix when one of the operand is 
a fermiJield and the other a gaugeJield or a matrix. Notice that there is no 
implicit complex conjugation of the vector at the r.h.s. of a vector by matrix 
multiplication, i.e. f = f i * m translates into 

DO s=l,4 

DO xyzt=0,NXYZT2-l 
DO i=l,3 

f %f c (i , xyzt , s) =f l°/of c (1 , xyzt , s) *myomc ( 1 , i) & 

+f l%f c (2 , xyzt , s) *m%mc (2 , i) +f l%f c (3 , xyzt , s) *m%mc (3 , i) 

END DO 
END DO 
END DO 

The following additional operations have a special meaning: 
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gi/g2 : 

the gauge field gi is multiplied, site by site, by the Hermitian adjoint of 
the gauge field g2 (the notation is motivated by the fact that, with unitary 
matrices, the Hermitian adjoint of a matrix is also its inverse; however, there 
is no restriction that the variables stored in a gauge field must represent 
unitary matrices). 

m/g2 and gi/m : same as above, but with m a matrix rather than a gauge 
field. 

gi//g2 : the Hermitian adjoint of the gauge field gi is multiplied, site by 
site, by the gauge field g2. 

gi//m and m//g2 : same as above, but with m a matrix rather than a gauge 
field. 

f /g and g//f : the Fermi field f is right or left multiplied, site by site, by 
the Hermitian adjoint of the gauge field g. 

f/m and ni//f : same as above, but with m a matrix rather than a gauge 

field. 

f 1 * f2 : 

this operation returns a complex field having as site components the scalar 
product, taken over the color and the spin indices, of the complex conjugate 
of f 1 and ±2- Explicitly, c = f i * f2 would be implemented as 

DO xyzt=0,NXYZT2-l 

c%fc(xyzt)=0 
DO s=l,4 
DO i=l,3 

c°/of c(xyzt)=cy„f c(xyzt) & 

+CONJG(f l%f c(i ,xyzt , s) ) *f 2%f c (i ,xyzt , s) 

END DO 

END DO 
END DO 

f i//f2 : 

this operation returns a variable of type gaugejield having as site components 
the dyadic (over the color indices) of f i and the complex conjugate of f 2. The 
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spin indices are summed over. Explicitly, g = fi//f2 would be implemented 
as 

DO xyzt=0,NXYZT2-l 
DO i=l,3 
DO j=l,3 

g%f c ( i , j , xyzt ) =f ly.f c ( i , xyzt , 1 ) *CON JG (f 2y.f c ( j , xyzt , 1) ) 
DO s=2,4 

g'/of c(i, j ,xyzt)=gy„f c(i, j ,xyzt) & 

+f l%f c(i ,xyzt , s) *CONJG(f 2%f c ( j ,xyzt , s) ) 

END DO 
END DO 
END DO 
END DO 

gei * ge2 : this operation returns a real field having as site components the 
scalar product of the site components of the generators. 
g1.D0t.g2 : this operation returns a real field having as site components the 
the real part of the trace of the product of the Hermitian adjoint of the site 
components of the gauge field gi with the site components of the gauge 
field g2. 

The following named operators are also defined: 
i. Gamma. f , where i is a scalar integer. This operation returns a Fermi field 
having as site components the product of a single 7 matrix or of a pair of 7 
matrices times the site components ip^ of the Fermi field f . Our convention 
is as follows. The integer variable i can take value 1 through 5 or value 
10*ii+i2, where ii and ^2 can again range from 1 to 5. In the former case the 
operator implements the product 7j'?/'x- In the latter case the pair ii, 12 stands 
for two indices labeling a matrix 7^1 ^2, where 7^1 ^2 = f [TuTw - 7«7u], lib = 
= lil5 with i,ii,i2 = 1 . . .4, and the operator implements the product 
7iij2'?/'x- Thus, for instance, i = 25; f 1 = i.Gamma.f 2 would implement ipi^^ = 
7275'(/'2x- Products of 7 matrices have been explicitly incorporated in the 
definition of the .Gamma, operator because they are frequently encountered in 
the evaluation of matrix elements of fermionic variables. 
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f .GcLmma.i , where i is a scalar integer. This operation returns a Fermi field 
having as site components the product of site components of the Fermi field 
f times a single 7 matrix or of a pair of 7 matrices, following the same 
convention about the values of i as above. 

i. Lambda. g , where i is a scalar integer. This operation returns a gauge 
field having as site components the product of the matrix Aj times the site 
components of the gauge field g. 

g.Lambda.i , where i is a scalar integer. This operation returns a gauge field 
having as site components the product of the site components of the gauge 
field g times the matrix Aj. 

In addition we define the following unary operators: 
.1., .Minus., .Conjg., .Adj., .Ctr. .Tr. .Sqrt. and .Exp. 
When acting on a variable of type gaugeJield, fermi _field or complex_field 
.1. returns i times the variable. When acting on a variable of type real_field it 
returns a complex field given by i times the real field. This is introduced for 
efficiency, since the operator is implemented by switching real and imaginary 
parts with the appropriate change of sign, rather than through a complex 
multiplication. 

When acting on a variable of type gaugeJield, fermi_field, complex_field, 
real_field or generator_field, .Minus, returns the negative of the variable. 

When acting on a variable of type gaugeJield, fermi_field, complex Jield or 
matrix .Conjg. returns the complex conjugate of the variable, i.e. a variable 
whose complex components are the complex conjugate of the original one. 

When acting on a variable of type gaugeJield or matrix .Adj . returns the 
Hermitian adjoint of the variable. 

When acting on a variable of type gaugeJield or matrix .Ctr. returns a 
complex_field or complex number, respectively, equal to the trace (at each 
site in the case of a field) of the operand. 

When acting on a variable of type gaugeJield or matrix .Tr. returns a 
realJield or real number, respectively, equal to the real part of the trace (at 
each site in the case of a field) of the operand. 
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When acting on a variable of type realJield .Sqrt. returns a realJield 
having as site components the square root of the absolute value of the site 
components of the operand. At the same time the global variable context is 
set to .TRUE, at all sites where the operand is non-negative and to .FALSE, at 
all other sites. 

When acting on a variable of type realJield .Exp. returns a realJield 
having as site components the exponential of the site components of the 
operand. 

7.5 Modules assignjsotypel, assign_isotype2, 
assign_isotype3 and assign_mixed 

These modules define the normal assignment and a variety of overloaded 
assignments which are defined for efficiency (cf. Sect. |6.2| above) and pro- 
gramming convenience. They are presented as four separate modules (as- 
signjsotypel, assignJsotype2 and assignJsotypeS define assignments between 
variables of the same type, assign_mixed between variables of different type) 
to reduce the overall length of the individual modules. We reproduce here all 
the available assignments. We use the notational conventions we introduced 
at the beginning of Sect. |7^ . Namely, we use the symbols g, u, f , c, r, ge and 
m to denote variables of type gaugeJield, fulLgaugeJield, fermiJield, com- 
plexjield, realJield, generator Jield and matrix, respectively, and the sym- 
bols complex and real to denote a complex or real variable of kind REALS 
(cf. Sect. |7J| ). Also, we use subscripts, e.g. f i, f 2, to distinguish between two 
variables of the same type. 

When the assignment relies on the the fact that the global variables 
assign_type and assign_spec have a value different from their default val- 
ues ' =' and 0, we will denote this fact by the using the combined symbols 
assign_type(assign_spec) = to denote the assignment. For example, we 
would use fi + = f2 or fi U(2) = f2 to denote assignments which in the 
actual coding would be implemented as 
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assign_type='+' ; f l=f2 , or 

assign_type='U' ; assign_spec=2; fl=f 2 , respectively. 

A general rule is that all assignments set the global variables assign.type 
and assign_spec equal to their default values ' =' and 0, no matter what 
the assignment does. As discussed in Sect. |6.2| , this is done in order to avoid 
the accidental use of erroneous assignments. 

For the parity component, the rule is that, if the destination is not an 
operand in the assignment (i.e. it is a variable with strict INTENT(OUT)), the 
parity component (if present) of the variable at the l.h.s. of the assignment 
(destination) is set equal to the parity of the variable at the r.h.s. of the 
assignment (source), or set to —1 if the source has no parity. Similarly, 
when the destination is not an operand in the assignment and has a dir 
component, this is set equal to the dir of the source or to if the source has 
no dir. An exception to the rule above about the parity component occurs 
with the assign_type =' u', assign_type =' w' and assign_type =' x' 
assignments, which copy into the destination a shifted source. In this case, 
if the parity of the source is defined, the parity of the destination is set to 
the opposite value. 

If the destination is an operand in the assignment (i.e. it is a variable with 
INTENT(INOUT)) parity and dir are treated in a manner similar to what 
happens in the definition of the operators implemented by the overloaded 
assignment. Typically, if the destination and the other operand have the 
same parity, this is preserved, otherwise the parity of the destination is 
set to —1 (undefined). An exception is found in the assignments U = and W = 
which implement the sum of the destination with a shifted operand, in which 
case the parity of the destination is preserved if the other operand has the 
opposite parity (as is the case in a geometrically meaningful operation) and 
is returned undefined otherwise. 

In what follows we list all of the available assignments and define their 
action, appending a few words of explanation when appropriate. When the 
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assignment is not followed by further clarifications, it means that it is a 
straightforward assignment (with assign_type ' =') copying the content of 
the source into the destination. Also, whenever the assignment implements 
operations which can be performed by using overloaded operators, we illus- 
trate its action simply by reformulating it in terms of these operators. We 
refer to the sections detailing the modules where the overloaded operators 
are defined for clarification of their action. 

The assignments are listed in order of destination type, first, and then 
of source type. The ordering of the types is the same as their order of 
introduction in Sect. |. 

Available assignments: 

gl = g2 

gl + = g2 (gl = gl + g2) 

gl - = g2 (gl = gl - g2) 

gl * (0) = g2 (gl = gl * g2) 

gl * (-1) = g2 (gl = g2 * gl) 

Notice how the assign_spec variable is used, above and immediately 
below, to specify the order of the operands in the non- commutative matrix 



multiplication. 






gl /(O) = g2 


( 


;gi = gi/g2) 


gl /(-I) = g2 




(gl = g2//gl) 


gl u(dir) = g2 




(gl = dir.ushift.g2) 


gl U(dir) = g2 




(gl = gl + (dir.ushift.g2)) 


gl t = g2 ( 


;gi 


= g2 where context is .TRUE.) 


gl f = g2 ( 


;gi 


= g2 where context is .FALSE.) 


gl A = g2 ( 


;gi 


= •Adj.g2) 


gl C = g2 ( 


;gi 


= •Conjg.g2) 


gl I = g2 ( 


;gi 


= •I-g2) 


gl M = g2 ( 


;gi 


= . Minus. g2) 


g = u (u%uc( 


g%parity, g%dir) is copied to 
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g t = u (same as above, but only where context is .TRUE.) 

g f = u (same as two lines above, but only where context is .FALSE.) 

g = ge (g = .Matrix.ge) 

g E = ge (g = .Exp.ge) 

g = m (all elements of g are set equal to m) 

g * (0) = m (g = g * m) 

g * (-1) = m (g = m*g) 

g * = complex (g = g * complex) 

g * = real (g = g * real) 

g / = complex (g = g/ complex) 

g/ = real (g = g/real) 

^ = g (g%f c is copied to u%uc(g%parity, g%dir) ) 

u t = g (same as above, but only where context is .TRUE.) 

u f = g (same as two lines above, but only where context is .FALSE.) 

Ui = U2 

f * (0) = g (f = f * g) 

f *(-l) = g (f = g*f) 

f /(O) = g (f = f /g) 

f/(-l) = g (f = g//f) 

(Note the function played by assign_spec in the four preceding assign- 



ments.) 








fi 


= f2 








fi 


+ =f2 


(fi 


= fl + 


f2) 


fi 


- = f2 


(fi 


= fl- 


f2) 


fi 


u(dir) = 


f2 


(fl = 


dir.Ushift.f2) 


fi 


U(dir) = 


f2 


(fl = 


f 1 + (dir.Ushift.f2)) 


fi 


w(dir) = 


f2 


(fl = 


dir.Wshift.f2) 


fi 


W(dir) = 


f2 


(fl = 


f 1 + (dir.Wshift.f2)) 


fi 


x(dir) = 


f2 


(fl = 


dir.Xshift.f2) 


fi 


X(dir) = 


f2 


(fl = 


f 1 + (dir.Xshift.f2)) 
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f 1 c = 


= f2 


( 


;f 1 = .Conjg.fa) 


fl I = 


= f2 


( 


;fi = .i.f2) 


f 1 M = 


= f2 


( 


'f 1 = .Minus. £2) 


f * = 


C 1 


[i 


= f * c) 


f / = 


C 1 


:f 


= f/c) 


f * = 


r 1 


[f 


= f * r) 


f / = 


r 1 


:f 


= f/r) 


f * = 


complex 


(f = f * complex 


f * = 


real 




(f = f * real) 


f / = 


complex 


(f = f / complex) 


f / = 


real 




(f = f/real) 


c = g 


(c 




■ -Ctr.g) 



Cl = C2 
Ci + = C2 
Cl — = C2 
Cl * = C2 
Cl / = C2 

Cl C = C2 
Cl M = C2 

Cl I = C2 

c = r 
c + = r 
c — = r 
c * = r 
c / = r 
c M = r 
c = complex 

c + = complex (c = c + complex) 
c — = complex (c = c — complex) 
c * = complex (c = c * complex) 



(Ci = Cl + C2) 
(Ci = Cl - C2) 
(Ci = Cl * C2) 
(Ci = C1/C2) 

(ci = .Conjg.C2) 
(ci = .Minus. C2) 
(ci = .I.C2) 

(c = c + r) 

(c = c — r) 
(c = c * r) 
(c = c/r) 
(c = . Minus. r) 
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c / = complex (c = c/complex) 

c M = complex (c = .Minus. complex) 



c = real 



c + 




: real 


(c = 


: c + real) 


c — 




: real 


(c = 


: c — real) 


c * 




real 


(c = 


c * real) 


c/ 




real 


(c = 


c/real) 


c M 




real 


(c = 


.Minus. real) 


r = 


g 


(r = 


.Tr.g 


) 


r = 


f 


fr = 


f * 





r = c the elements of r are set equal to the real part of the elements of c 
ri = r2 

ri + = T2 (ri = ri + rs) 

ri - = rs (ri = ri - ra) 

ri * = rs (ri = ri * ra) 

ri / = r2 (ri = ri/r2) 

ri M = r2 (ri = . Minus. r2) 

ri R = r2 (ri = .Sqrt.ra) 

ri E = r2 (ri = .Exp.r2) 
r = real 

r + = real (r = r + real) 

r — = real (r = r — real) 

r * = real (r = r * real) 

r / = real (r = r/real) 

r M = real (r = .Minus. real) 
ge = g (ge = . Generator. g) 
gei = ge2 

gei + = ge2 (gei = gei + ge2) 

gei - = ges (gei = gei - ge2) 

gei M = ge2 (gei = .Minus. gea) 
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gei S = ge2 (gei = .Sq.gea) 

ge * = r (ge = ge * r) 

ge / = r (ge = ge/r) 

ge * = real (ge = ge * real) 

ge / = real (ge = ge/real) 

The following assignments perform global reductions, either absolute or 
restricted to the lattice sites where context is .TRUE, or .FALSE.: 
complex = c (complex = X^xyzt c(xyzt)) 
complex t = c (complex = EwHERE(context(xyzt)) c(xyzt)) 

complex f = C (complex = EwHERE(.NQT.context(xyzt)) c(xyzt)) 

real = c (real = Exyzt Real[c(xyzt)]) 

real t = c (real = EwHERE(context(xyzt)) Real[c(xyzt)]) 

real f = c (real = EwHERE(.NOT.context(xyzt)) Real[c(xyzt)]) 

real = r (real = Exyzt r(xyzt)) 

real t = r (real = EwHERE(context(xyzt)) r(xyzt)) 

real f = r (real = EwHERE(.NOT.context(xyzt)) r(xyzt)) 

7.6 Module shifts 

This module defines the operators .Cshift., .Ushift., .Wshift. and .Xshif t. 
The left operand for all these operators is an integer m which must take 
one of the values 1, 2, 3, 4, —1, —2, —3, —4 and specifies the direction and 
orientation of the shift. The right operand can be any variable of field type 
for .Cshift.. It can be any variable of field type with the exception of the type 
generator _field for .Ushift., while it must be a variable of type fermi_field for 
.Wshift. and .Xshif t. The parity component of the right operand must be 
defined, i.e. take value or 1. If the parity is not defined or if m does not take 
one of the values specified above the function call implementing the operator 
returns an error message and stops the program. All of these operators return 
a field variable of the same type as the left operand and opposite parity. If 
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the right operand is of the type gauge_field or generator _field and thus has 
also a dir component, this is passed to the resuh unchanged. 

.Cshif t. implements an ordinary C-shift of the field, but with respect to 
the Cartesian geometry of the lattice. This is why the parity is interchanged. 
Given a site with Cartesian coordinates x in the sublattice of the parity of 
the result, the operator copies into the corresponding element of the result 
the element of the right operand which is defined over the lattice site x + s/i, 
where s is the sign of m and fi = abs(m). 

.Ushift. moves the data in a manner similar to . Cshif t., but with the 
inclusion of the appropriate transport factors, defined in terms of the global 
field U (cf. "global module" ). For the variable of type gaugeJield and fermi_field 
the U-shift operation has been defined in Sect. ^ (cf. Eqs. (p!3D, (Q), and 



Eqs. (|[), (pr2D, for gauge fields and Fermi fields, respectively). The action 
of a U-shift on variables of type complex_field and realJield reduces to a C- 
shift. The U-shift of a generator field is not normally encountered in QCD 
simulations and for this reason it is not explicitly implemented here. It can 
be implemented by using the functionality provided by the module genera- 
tor_algebra to re-express the generator field as a field of hermitian matrices 
(i.e. of type gaugeJield), shifting the latter, and converting it again to a 
generatorJield. 

The operator .Wshift. acts only on Fermi fields and it is a combination 
of a .Ushift. and the product with a 7 matrix. Precisely, if we again define 
s to be the sign of m and define /i or mu to be the absolute value of m, then 
the operation f 2 = m.Wshif t.f 1 is equivalent to 
f 2=m . Ushift . f 1-s* (mu . Gamma . (m . Ushift . f 1 ) ) 

Equivalent ly 

/2,x = (l-7M)f^x^/l,x+A (21) 

for positive m, and 

/2,x = (l + 7^)f/IVw (22) 
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for negative m. 

The operator .Xshift. also acts only on Fermi fields and it is equivalent 
to a W-shift bracketed by two matrices 75 (where 75 = 71727374): 



for positive m, and 



/2,x = 75(l-7M)75t/,^/i,x+A (23) 
/2,x = 75(1+7;.)75^^'1a/i,x-a (24) 



for negative m. 

The reason for the explicit introduction of the W-shift and X-shift op- 
erators is that the combinations ( pl| - |2^) appear in the Wilson discretization 



of the Dirac operator, which is a very widely used lattice Dirac operator, 
and related algebra, and that the calculation of the l.h.s. of Eqs. (^T[]2^) is 
one of the most time consuming tasks of any QCD simulation. Moreover, 
the combinations 1 ±7^ appearing in ([2l| - p^ are projection operators, which 
effectively limits the U-shift to a subspace of the spin space of dimensionality 
two. Thus a direct implementation of the W-shift, rather than via a combi- 
nation of the .Ushift. and .Gamma, operators, entails substantial advantages 
of efficiency. 

7.7 Module Dirac_operator 

The (Wilson) lattice Dirac operator, acting on a Fermi field /i, produces a 
Fermi field /2, given by 

/2,x = E[(l - 7M)f^x^/i,x+A + (1 + l,)Ul-f.fi,^-f^] (25) 

It is obvious from this equation that the lattice Dirac operator only connects 
components of Fermi fields of opposite parity. The unary operator .Dirac. 
accepts as operand a variable of type fermiJield, which must have a definite 
parity, and returns a variable of the same type and opposite parity given 
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by the action of the lattice Dirac operator ( pSj) on the operand. The unary 
operator .XDirac. implements the action of the Dirac operators bracketed 
by two matrices 75, i.e. , if f is a variable of type fermiJield, . XDirac. f re- 
turns the same results as i5. Gamma.(. XDirac. (iS.Gamma.f)), where the integer 
variable i5 equals 5. 

In this module the operators .Dirac. and .XDirac. are implemented using 
the operators .Wshift. and .Xshift., introduced in the module field_algebra. 
We have defined them as separate operators for convenience of coding and 
also because, the application of these operators being the most CPU intensive 
part for the majority of applications, this module isolates the code whose 
optimization would produce the largest returns. A programmer striving for 
exceptional efficiency might want to code this module as a highly optimized, 
self-standing implementation of the lattice Dirac operator. Even if this route 
is chosen, we are certain that the advantages of having a module written at 
a higher level against which to compare the results of the optimized module 
are not lost on the practicing computational scientist. 

7.8 Module generator _algebra 

This module defines the unary operators .Matrix., .Generator., .Sq. and 
.Exp. which perform some special operations involving generator fields. The 
operators accept arguments of the type generator_field or gaugeJield and 
return as result a variable of one of these types. The parity and dir com- 
ponents of the argument are passed on to the result. 

.Matrix, accepts an argument ge of type generator_field and returns a 
result V of type gaugeJield containing, site by site, the Hermitian matrix 

= J2 ^^ij 9(^k,K ■ (26) 

k 
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.Generator, accepts an argument v of the type gaugeJield and returns a 
result ge of type generatorJield containing, site by site, the traceless, anti- 
hermitian part of the argument: 

gek,^ = -| E 4- - <j,x) • (27) 

Notice that .Generator, and .Matrix, are not inverse operators. However 
.Generator.(.Matrix.(lU*ge)) does return ge, while .Matrix.(. Generator. v) 
returns the traceless, antihermitian part of v: — {v — v^)/{2i). 

.Sq. accepts an argument of type generatorJield gei and returns a result 
ge2 of the same type containing, site by site, the generator corresponding to 
the traceless part of the square of .Matrix.gel: 

{9^2)1,. = ^ E [^.. ( E ^"l (5ei)m,x) ( E Ki {m)n,.)] ■ (28) 

ijk m n 

.Exp. accepts an argument ge of type generatorJield and returns a result 
V of type gaugeJield containing, site by site, the exponentiated generator 
component: 

%,x = [ exp gek,^)]ij . (29) 

k 

The algorithm used for this exponentiation deserves a few words of explana- 
tion. Let us define H ^ J2k >^ij g^k,^, Q = Tr^^ and p = DetH = {TrH^)/3. 
Prom the characteristic equation (recall that TrH — 0) 

H'-^H-pI^Q, (30) 

/ being the identity matrix, satisfied by H and therefore by its eigenvalues 
hn, we can easily calculate the eigenvalues as 

hn — acos[a + 27r{n — l)/3\, n — 1,2,3, (31) 

with a = ^2q/3, a = [cos~^(4p/a^)]/3. We order the eigenvalues so that 
\hi\ > |/i2| > \hs\. 
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In a basis where H is diagonal, it is easy to express it as a linear com- 
bination of two matrices of type "A^ " (one eigenvalue) and "A^ " (two 
degenerate eigenvalues), respectively. Of the six different ways in which this 
can be done we use the decomposition 

H = S + K (32) 

with S = diag(— /ii — 2/^2, hi + 2/^2, 0), = diag(2(/ii + /i2), —hi — h2, —hi — 
h2). 

By using the eigenvalues determined above, it is straightforward to ex- 
press S in the form 

S = CiH + C2{H^-qI/3) . (33) 

Through their dependence on the eigenvalues and Eq. (pTD, however, ci and 
C2 are functions of the invariants q and p only. It follows that Eqs. (^) 
and (|32D provide a decomposition into two matrices of type "A^ " and "A* " 
irrespective of the basis. On the other hand, with a matrix of type "A^ " it is 
straightforward to calculate exp(zS') expressing it as a linear combination of 
/, S and 5*^. Similarly exp{tK) can be expressed as a linear combination of / 
and K. exp{tH) can be finally calculated as product of the two commuting 
matrices exp{tS) and expltK). 

It is very important to have an efficient algorithm for the exponentiation 
of a matrix, since this operation can be a time consuming component of sev- 
eral QCD calculations. The algorithm outlined above has been implemented 
in the module "generator_algebra" by performing a substantial amount of the 
algebra directly in terms of generator components and inlining all of the op- 
erations. The exponentiation can thus be done with a reasonably contained 
number of arithmetic operations, in particular approximately 300 explicit real 
multiplications. By way of comparison, just one product of 3 x 3 complex 
matrices requires 108 real multiplications (i.e. 27 complex multiplications - 
these could also be performed with 81 real multiplications, but then with a 
much larger number of adds). 
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7.9 Module random_numbers 

This module implements a parallelizable version of the unix pseudorandom 
number generator erand48 and provides added functionality. 

erand48 is a congruential pseudorandom number generator based on the 
iterative formula 

Sj+i = ai * Si + bi mod m , (34) 

where Oi = 0x5DEECE66D, bi = OxB, m = 2^*^, Si and Sj+i are integers of at 
least 48 bits of precision. The "seeds" Sj are converted to real pseudorandom 
numbers with uniform distribution between and 1 by = 2~^'^Sj. 

As presented above, the algorithm is intrinsically serial. However it fol- 
lows from Eq. that the N*^ iterate Sj+at is still of the form Sj+at = 
dN * Si + bN mod 2^^ with integers qn and bN which are uniquely determined 
by Oi, bi. The module takes advantage of this fact and of the existence of the 
global variable seeds (cf. global_module) to generate pseudorandom numbers 
in a parallelizable fashion. 

The module defines the following unary operators: .Seed., .Rand., .Gauss, 
and .Ggauss.. 

.Seed, must be used to initialize the generation of pseudorandom numbers. 
When invoked with an argument saved_seed of kind LONG (8-byte integer, 
defined in the module precisions) .Seed, sets the global variable seeds to the 
sequence ( P^ beginning with saved_seed and also sets the global variables 
seed_a, seed_b to the appropriate multiplier and constant term, and b^, 
needed to produce increments by = NX *NY *NZ *NT/2 in the sequence 
of pseudorandom numbers. It returns .TRUE.. 

When acting on a logical variable equal to .TRUE., .Seed, returns the 
current seed (=seeds(0, 0, 0, 0)), which must be used to restart the sequence 
of pseudorandom numbers. If the argument is .FALSE., .Seed, returns 0. 

The unary operator .Rand., if invoked with a real argument real of kind 
REALS, returns a realJield of pseudorandom numbers with uniform distribu- 
tion between and real. At the same time it upgrades the global vari- 
ables seeds using the multiplier a^r (i.e. seed_a) and constant term bpf 
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(i.e. seed_b). It follows that subsequent calls to .Rand, produce real fields 
with the same distribution of pseudorandom numbers which one would have 
obtained invoking erand48 sequentially within nested DO loops: 

DO xyzt=0,NXYZT2-l 

The parity of the results is undefined. 

If .Rand, has an argument of type rcalJicld, it returns a reaLfield of pseu- 
dorandom numbers uniformcly distributed between and the corresponding 
component of the argument. The parity of the result is the same as the parity 
of the argument. 

The unary operator .Gauss, returns a real field of pseudorandom numbers 
with gaussian distribution of mean square deviation equal to the argument 
of .Gauss, and upgrades the global variable seeds. The argument can again 
be a variable of kind REALS or of type realJield and the parity of the result 
is undefined or equal to the parity of the argument, respectively. 

The unary operator .Ggauss. works like .Gauss, but fills with gaussian 
random numbers the components of a generatorJield, setting its direction 
equal to 0. Precisely, the instruction ge = . Ggauss. r, although in the module 
it is implemented differently, would be equivalent to 

DO i=l,8 

auxr=.Gauss.r 

geZf c(i, :,:,:, : )=auxr%f c 

END DO 

ge y„p ar i t y = auxr Xpar i t y 
ge%dir=0 

where auxr is a variable of type realJield 

This module assumes the availabihty of long (8-byte) integers and the 
fact that a multiplication of long integers will return the lowest 8 bytes of 
the product (i.e. a*b mod 2^) without producing an arithmetic error when 
the product exceeds the maximum long integer. If these assumptions are not 
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satisfied, the module should be replaced with some other suitable module. 
Also, we would like to point out that the algorithm of Eq. ( ^4|) produces pseu- 
dorandom numbers of reasonably good quality and period (~ 10^^). However, 
a computer capable of 100 Gflops sustained running a program that makes 
use of one pseudorandom number every thousand floating point operations 
would exhaust the whole set of pseudorandom numbers in one million sec- 
onds, which is not a very long time. Thus for calculations that are very 
computer intensive or which demand pseudorandom numbers of exception- 
ally good quahty, the module should be modified to meet the more stringent 
demands. Two improvements which can be made with minimal changes 
would consist in the use of a larger m in Eq. ( P^ (with appropriate Oi and 
bi) and/or of a reshuffle of the pseudorandom numbers produced by the algo- 
rithm. Of course, one could also make use of the F90 RANDOM_NUMBER 
subroutine, but the results would no longer be machine independent. 

7.10 Module conditionals 

This module defines six overloaded relational operators, >, >=, <, <=, ==, 
/ =, and the operator .Xor. 

The relational operators take as operands two realJields or one realJield 
and one real variable of kind REALS. They return a logical variable which is 
set to .TRUE, if the two fields have the same (defined) parity or if the single 
field operand has defined parity, and is set to .FALSE, otherwise. At the 
same time the global variable context is set to .TRUE, at all sites where the 
relation is satisfied, to .FALSE, at all other sites. For example, the function 
implementing the relational operator a > b, with a and b of type realJield, 
could contain a line: context = a%f c > b%f c , which produces the action 
mentioned above. 

The operator .Xor. accepts as operands a pair of fields of the same type 
and returns a field, also of the same type, having as elements the corre- 
sponding elements of the first operand at the sites where the global variable 
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context is .TRUE., the elements of the second operand at the sites where 
context is .FALSE.. For clarification, the function g_xor_g implementing the 
operation gl.Xor.g2, where gl and g2 are fields of type gaugeJield, would 
contain the code 

DO j=l,3 
DO i=l,3 

WHERE (context) 

g_xor_gyof c(i, j , : ) =giyof c (i , j , :,:,:,:) 

ELSEWHERE 

g_xor_gyof c(i, j ,:,:,:, :)=g2y.fc(i, j ,:,:,:,:) 
END WHERE 
END DO 
END DO 

The parity of the field returned by .Xor. is the common parity of the two 
operands if they have the same parity, otherwise it is undefined. In addition, 
for operands of type gaugeJield, the dir component of the returned field is 
the common dir of the operands if they have the same dir, otherwise it is 
set to 0. 

The operators provided by the module "conditionals" can be very con- 
veniently used to select elements out of two fields according to some local 
condition, an operation which lies at the foundation of stochastic simulation 
techniques. 

7.11 Subroutine write_conf and read_conf 

The file "writc_rcad_conf.f90" contains two subroutines which serve to store 
and retrieve an entire SU(3) gauge field configuration, written in a portable, 
compressed ASCII format. Only the first two columns of the gauge field 
matrices are stored, because the third one can be recovered from the unitarity 
and unimodularity constraints. The write_conf subroutine takes advantage 
of the fact that all of the elements of the gauge field matrices have magnitude 
smaller or equal to 1 to re-express their real and imaginary parts as 48bit 
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integers. These integers are then written in base 64, with the digits being 
given by the ASCII collating sequence starting from 0. Thus, 8 characters are 
needed to represent either the real or the imaginary part of the gauge field 
matrix elements and an entire gauge field matrix is represented by 96 ASCII 
characters, without loss of numerical information. A detailed description of 
the contents of the file generated by write_conf and of the standard used for 
coding the information is written, as a header, at the beginning of the file 
itself. This makes the file with the compressed gauge configuration portable 
and usable, irrespective of the availability of the write_conf and read_conf 
subroutines or of a separate description of their implementation. 

8 Example code 

In order to illustrate the power of the modules we developed, we present here 
two programs which implement a multihit Metropolis simulation of quenched 
QCD and the calculation of a quark propagator. Anybody familiar with the 
complexity of the programs for lattice QCD simulations will appreciate the 
conciseness of our examples. It is also to be noticed that a large amount of the 
code in the programs performs peripheral functions, such as initialization and 
printout. If we consider the Metropolis simulation program, for instance, the 
section of the code which performs the actual upgrading steps consists of only 
28 lines! It is our hope that researchers interested in using our modules will 
find it easy to become familiar with their functionality and that, not being 
hindered by inessential programming burdens, they will thus be able to make 
much faster progress in the development of far-reaching QCD applications. 

8.1 quenched. f90 

! Program Qcdf 90_quenched 

! Copyright by Indranil Dasgupta, Andrea R. Levi, Vittorio Lubicz 
! and Claudio Rebbi - Boston University - May 1996 
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! This program may be freely copied and used as long as this notice 
! is retained. 

PROGRAM Qcdf90_quenched 

USE precisions 
USE constants 
USE global_module 
USE f ield_algebra 
USE generator_algebra 
USE conditionals 
USE shift 

USE random_numbers 

USE assign_mixed 
USE assign_isotypel 

IMPLICIT NONE 

TYPE(gauge_f ield) : : staple, g_old,g_new 
TYPE(real_f ield) : : plaq_old,plaq_new,bf _ratio,rand 
TYPE (generator_f ield) : : ge 
TYPE (matrix) : : zero_matrix, unit_matrix 
LOGICAL l_test,l_seed 

REAL (REALS) clock_dcl , clock_upd , clock_plaq 

REAL (REALS) beta , saved_bet a , hp , av_plaq , aux , range_small , range_unit 

CHARACTER (LEN=64) in_f ilename , out_f ilename 

CHARACTER (LEN=16) id 

INTEGER (LONG) saved.seed, inp.seed 

INTEGER clock_rate , clock.l , clock_2 

INTEGER hot coldread , save , num_upd , p , m , sign , nu , i , hit , num_hit 

! input variables : 

WRITE (*,' ("Lattice size: 415)0 NX, NY, NZ, NT 
WRITE (*,' ("Enter beta: " ) ' , ADVANCE= ' NO ' ) 

READ *,beta 

WRITE (*,' ("Enter number of updates: " ) ^ ADVANCE= ' NO ' ) 

READ *,num_upd 

WRITE (*,' ("Select the starting configuration. Enter for& 
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&a hot start ") ') 

WRITE (*,'("! for a cold start, 2 to read from Disk: " ) ' , ADVANCE= ' NO ' ) 
READ *,hotcoldread 



other useful variables : 
num_hit=6 

raiige_unit=l . _REAL8 
range_sinall=0 . 1_REAL8 
inp_seed=l 

zero_niatrix°/omc=ZERO_m 



number of Metropolis multiple hits 
unitary range for the random numbers 
range for the random numbers 
input seed for random numbers generator 
matrix for initializing the staple 



in_filename= ' configuration. in' 
out_filename=' configuration. out ' 

! initializing system clock 

CALL SYSTEM_CLOCK(clock_l , clock.rate) 
clock_dcl=l . _REAL8/clock_rate 
clock_upd=0 . .REALS 
clock_plaq=0 . .REALS 

! initializing random generator and gauge configuration 
SELECT CASE (hot coldread) 
CASE(O) 

l.seed= . Seed . inp.seed 
DO p=0,l 
DO m=l,4 

ge= . Ggauss . range.unit 
u°/oUC (p , m) = . Exp . ge 
uZuc (p , m) y„parity=p 
uZuc (p , m) yodir=m 
END DO 
END DO 
CASE(l) 

l.seed= . Seed . inp.seed 
unit.matrix%mc=UNIT 
DO p=0,l 
DO m=l,4 

u'/oUC (p , m) =unit .matrix 
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uy„uc (p , m) y„parity=p 

u'/oUC (p , m) y„dir=in 
END DO 
END DO 
CASE(2) 

CALL read_conf (saved_beta, id,hp, saved_seed, in_f ilename) 
IF(inp_seed==0) THEN 

WRITE (*, ' ("saved_seed=",I15) ') saved.seed 

l_seed= . Seed . saved_seed 
ELSE 

l_seed= . Seed . inp_seed 
WRITE (*,'("seed re-initialized")') 
ENDIF 
CASE DEFAULT 

WRITE (*, ' ("hotcoldread must only be 0,1 or 2")') 
STOP 
END SELECT 

DO i=l,num_upd ! Main Loop 

! Metropolis update 

CALL SYSTEM_CLOCK(clock_l) 
DO p=0,l 
DO m=l,4 

! Staple 

staple=zero_matrix 
stapleyoparity=p 
stapleyodir=m 
DO nu=l,4 

IF(nu.EQ.m) CYCLE 

DO sign=-l,l,2 

staple=staple+ ( (nu*sign) . Ushif t . uXuc ( 1-p , m) ) 

END DO 
END DO 

g_old=u%uc (p , m) 
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DO liit=l,num_hit 

plaq_old=g_old . Dot . staple 
ge= . Ggauss . range_small 
geyoparity=p 
ge°/odir=m 

g_new= ( . Exp . ge) *g_old 
plaq_new=g_new . Dot . staple 

bf _ratio= . Exp . (beta/3 . _REAL8* (plaq_new-plaq_old) ) 
rand= . Rand . range_unit 
l_test=rand<bf _ratio 
assign_type='t ' ; g_old=g_new 
END DO 

u%uc (p , m) =g_old 
END DO 
END DO 

CALL SYSTEM_CL0CK(clock_2) 

clock_upd=clock_upd+(clock_2-clock_l)*clock_dcl 

! Plaquette 

CALL SYSTEM_CLOCK(clock_l) 
av_plaq=0 . _REAL8 
DO p=0,l 
DO in=l,3 

DO nu=m+l,4 

aux=uy„uc(p,m) .Dot. (nu.Ushift.uyoUc(l-p,m)) 
av_plaq=av_plaq+aux 
END DO 
END DO 
END DO 

av_plaq=av_plaq/REAL ( 18*NXYZT , REALS) 
CALL SYSTEM_CL0CK(clock_2) 

clock_plaq=clock_plaq+(clock_2-clock_l)*clock_dcl 
WRITE (*,' ("iteration ",I5," av. plaq.= ",F10.6)') i,av_plaq 

END DO ! End Main Loop 

! Save configuration on disk 
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WRITE (*,'("Save configuration on disk ? (Yes=l, & 
&No=0) : , ADVANCE= ' NO O 

READ *,save 
IF(save==l) THEN 

WRITE (*,^ ("saving the configuration") O 

id='conf 0.0.0' 

hp=0 . 

l_seed=.TRUE. 

saved_seed= . Seed . l_seed 

WRITE (*,'(" saved_seed = ",I15)') saved.seed 
CALL write_conf (beta , id , hp , saved_seed , out_f ilename) 
ENDIF 



! Print timing 

WRITE (*,'("Av. upgrade time in microsecs per link" ,F9 . 3) ' ) & 

( 1000000*clock_upd) / (4*NXYZT*num_upd) 
WRITE (*,'("Av. measure time in microsecs per plaquette" ,F9 .3) ' )& 

( 1000000*clock_plaq) / (6*NXYZT*num_upd) 

END 



8.2 propagator. f90 

! Program Qcdf 90_propagator 

! Copyright by Indranil Dasgupta, Andrea R. Levi, Vittorio Lubicz 

! and Claudio Rebbi - Boston University - Jeinuary 1996 

! This program may be freely copied and used as long as this notice 

! is retained. 

PROGRAM Qcdf90_propagator 

USE precisions 
USE constants 
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USE global_module 
USE f ield.algebra 
USE generator_algebra 
USE conditionals 
USE shift 
USE dirac 

USE random_numbers 
USE assign_mixed 
USE assign_isotypel 
USE assign_isotype2 



IMPLICIT NONE 

TYPE(f ermi_f ield) : : psi,chi,grad,h,m_h,nip_in_h 
REAL (REALS) clock.dcl , clock.cg 

REAL (REALS) kappa , tolerance , residue , saved_beta , hp 

REAL (REALS) alpha, old_alpha,g_2,g_old_2,beta_cg, old_beta_cg 

REAL (REALS) h_a_h,norm_psi 

CHARACTER (LEN=64) in.filename 

CHARACTER (LEN= 16) id 

INTEGER (LONG) saved.seed 

INTEGER clock_rate , clock.l , clock_2 

INTEGER iter , nsteps , niter , init_niter , stop_f lag , init_stop_f lag 
INTEGER i,xyzt,s 

! input variables : 

WRITE (*,' ("Enter kappa: " ) ' , ADVANCE= ' NO ' ) 

READ *, kappa 

WRITE (*,' ("Enter max numbers of eg steps: " ) ' , ADVANCE= ' NO ' ) 

READ *, nsteps 

WRITE (*,' ("Enter tolerance: " ) ' , ADVANCE= ' NO O 

READ *, tolerance 

! the conjugated gradient will 
! run until the residue<tolerance 
! or for a maximum of nsteps 



! other useful variables: 

in_filename= ' configuration. in' 
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init_stop_f lag=2 
init_niter=4 

! gauge configuration is read from the disk 

CALL read_conf (saved_beta, id.hp, saved_seed, in_f ilename) 

! the source chi (in the even sites) is set arbitrarly in this example 
DO i=l,3 
DO s=l,4 

DO xyzt=0,NXYZT2-l 

chi%f c (i , xyzt , s) = ( 1 . _REAL8 , 1 . _REAL8) /SQRT (REAL (24*NXYZT2 , REALS) ) 
END DO 
END DO 
END DO 
chiy„parity=0 

! psi must be initialized as the starting trial configuration. 
! the simplest choice is psi=chi 
psi=chi 

! initializing system clock 

CALL SYSTEM_CLOCK(clock_l , clock.rate) 
clock_dcl=l . _REAL8/clock_rate 

! Calculate psi as the solution of: M*psi=chi, 

! where M is the fermion matrix, and chi is a given source. 

! The residue is printed to monitor the convergence. 

stop_f lag=init_stop_f lag 

niter=init_niter 

iter=0 

m_h=psi- (kappa**2) * ( . Dirac . ( . Dirac . psi) ) 

mp_m_h=m_h- (kappa**2) * ( . Xdirac . ( . Xdirac . m_h) ) 

grad=chi-mp_m_h 

g_2=grad*grad 

h=grad 

norm_psi=psi*psi 
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resiclue=g_2/norm_psi 

WRITE (*, ' ("residue= ",F20.16," at step:", 15)') residue, iter 
old_alpha=0 . _REAL8 
old_beta_cg=l . _REAL8 

DO iter=l ,nsteps 

in_h=h- (kappa**2) * ( . Dirac . ( . Dirac . h) ) 

h_a_h=m_h*m_h 

beta_cg=g_2/h_a_h 

psi=psi+beta_cg*h 

norm_psi=psi*psi 

IF(mod(iter,niter)==0 .AND. g_2/norin_psi<tolerance) THEN 
stop_f lag=stop_f lag-1 

m_h=psi-(kappa**2) * ( .Dirac . ( .Dirac .psi) ) 

mp_m_h=m_h~ (kappa**2) * ( . Xdirac . ( . Xdirac . m_h) ) 

grad=chi-mp_m_h 

g_2=grad*grad 

h=grad 

g_old_2=g_2 

residue=g_2/norin_psi 

WRITE (*, ' ("residue= ",F20.16," at step:", 15)') residue, iter 
IF(stop_flag == 0) EXIT 
ELSE 

mp_m_h=ni_h- (kappa**2) * ( . Xdirac . ( . Xdirac . m_h) ) 

grad=grad-beta_cg*mp_m_h 

g_old_2=g_2 

g_2=grad*grad 

alpha=g_2/g_old_2 

h=grad+alpha*h 

norm_psi=psi*psi 

residue=g_2/norin_psi 

IF(mod(iter, niter) == 0) THEN 

WRITE (*, ' ("residue= ",F20.16," at step: ",15)0 residue, iter 
ENDIF 

old_beta_cg=beta_cg 
old_alpha=alpha 
END IF 
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END DO 

CALL SYSTEM_CL0CK(clock_2) 
clock_cg=(clock_2-clock_l)*clock_dcl 

!test solution: 

in_h=psi- (kappa**2) * ( . Dirac . ( . Dirac . psi) ) 

iiip_in_h=iii. .h- (kappa**2) * ( . Xdirac . ( . Xdirac . in_h) ) 

grad=chi-mp_m_h 

norm_psi=psi*psi 

g_2=grad*grad 

residue=g_2/norin_psi 

WRITE (*,' ("final residue= ",F20.16)') residue 

! Print timing 

WRITE (*,'("Cg time per iteration per link in microsecs" ,F9.3) O 

& 

(1000000*clock_cg) / (iter*4*NXYZT) 

END 



8.3 Compilation and sample run output 

The code has been tested and compiled on a SGI PowerChallengeArray with 
90 MHz processor nodes, using IRIX 6.1 Fortran 90, with a single processor 
-03 optimization flags or with the flags -03 -pfa -mp to implement multi- 
processing; on a SGI Indigo using the IRIX 6.1 Fortran 90; and on the IBM 
R6000 58H model 7013 at 55 MHz, with the xlf90 IBM compiler using the 
-03 optimization flags. 

The run of the example programs produce the following outputs when 
running on a single processor of the SGI PowerChallengeArray. 

Output of quenched. f90 

Lattice size: 8 8 8 8 
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Enter beta: 6.0 

Enter number of updates: 15 

Select the starting configuration. Enter for a hot start 
1 for a cold start, 2 to read from Disk: 1 



iteration 


1 


av. 


plaq. 


— 





849923 


iteration 


2 


av. 


plaq. 







773278 


iteration 


3 


av. 


plaq. 







727001 


iteration 


4 


av. 


plaq. 







699791 


iteration 


5 


av. 


plaq. 







677709 


iteration 


6 


av. 


plaq. 







664358 


iteration 


7 


av. 


plaq. 







654980 


iteration 


8 


av. 


plaq. 







645880 


iteration 


9 


av. 


plaq. 







638568 


iteration 


10 


av. 


plaq. 







635049 


iteration 


11 


av. 


plaq. 







631868 


iteration 


12 


av. 


plaq. 







628131 


iteration 


13 


av. 


plaq. 







624450 


iteration 


14 


av. 


plaq. 







621757 


iteration 


15 


av. 


plaq. 







619540 


Save configuration 


on disk ? 


(Yes=l, No 



saving the configuration 

saved.seed = 182618478903297 



Av. upgrade time in microsecs per link 275.533 

Av. measure time in microsecs per plaquette 8.624 

Output of propagator. f90 



Enter kappa: 0.155 

Enter max numbers of eg steps : 2000 
Enter tolerance: l.e-14 



residue= 





2417309784323778 


at 


step: 





residue= 





0047398663491857 


at 


step: 


4 


residue= 





0007379739612745 


at 


step: 


8 


residue= 





0001969957469930 


at 


step: 


12 


residue= 





0000640891236947 


at 


step: 


16 


residue= 





0000235032623593 


at 


step: 


20 


residue= 





0000079006298388 


at 


step: 


24 
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residue= 





0000030559375109 


at 


step 


28 


residue= 





0000014286464825 


at 


step 


32 


residue= 





0000006891018149 


at 


step 


36 


residue= 





0000003502268064 


at 


step 


40 


residue= 





0000002296348520 


at 


step 


44 


residue= 





0000001023143421 


at 


step 


48 


residue= 





0000000357544848 


at 


step 


52 


residue= 





0000000114945632 


at 


step 


56 


residue= 





0000000034230144 


at 


step 


60 


residue= 





0000000010030646 


at 


step 


64 


residue= 





0000000005153800 


at 


step 


68 


residue= 





0000000003507320 


at 


step 


72 


residue= 





0000000002150572 


at 


step 


76 


residue= 





0000000000880532 


at 


step 


80 


residue= 





0000000000289906 


at 


step 


84 


residue= 





0000000000095210 


at 


step 


88 


residue= 





0000000000042035 


at 


step 


92 


residue= 





0000000000016253 


at 


step 


96 


residue= 





0000000000004389 


at 


step 


100 


residue= 





0000000000001207 


at 


step 


104 


residue= 





0000000000000298 


at 


step 


108 


residue= 





0000000000000066 


at 


step 


112 


residue= 





0000000000000022 


at 


step 


116 



final residue= 0.0000000000000022 

Cg time per iteration per link in microsecs 21.548 
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